Introduction
Starting during World War II and the Second Wave of the feminist movement, women have been entering the workforce in greater numbers. As of 2015, women made up almost half of the American workforce at 46.8%. Despite this almost equal participation between men and women, women still earn less than men on average.
The gender wage gap has been and still is a contentious public issue. While there is no debate that women earn less than men, the key aspect of the debate has been this question:
Is there a significant difference in income between men and women and does this difference vary depending on other factors?
This is an important question to answer as, if it turns out that we cannot account for the difference appropriately, it could provide quantitative evidence that gender discrimination plays a significant role in causing the wage gap.
On the other hand, if it turns out that we can account for the difference by other causes, then this would suggest that there is not quantitative evidence supporting gender discrimination as a cause of the wage gap. It should be noted that such a finding is not the same thing as suggesting that gender discrimination does not influence the wage gap at all, only that other factors may play a relatively more significant role.
In our attempt to answer this question, we decided to look into these variables (and their relationship with gender and income) as possible other causes:
- Highest degree: Highest level of education complete as of 2017
- Occupation: Occupation as of 2017
- Industry: Industry as of 2017
- Childhood financial difficulty: Whether one ever experienced financially hard times during childhood
- Disability: Whether one ever struggled with a physical or mental issue that limited their ability to participate in work or school
- Urban vs. rural: Whether one lives in an urban or rural community
- Marital status: Marital status as of 2017
- Total children: Number of biological children in household as of 2017
- Spouse income: Spouse’s income as of 2017
- Incarceration Age: Age of first incarceration, if applicable
Data Cleaning
We will be using the NLSY97 (National Longitudinal Survey of Youth, 1997 cohort) data set. This data set contains survey responses from thousands of individuals who have been surveyed every one or two years starting in 1997.
In the process of studying our variables, we noticed several issues that needed to be rectified prior to conducting both graphical/tabular summaries and the regression analysis.
Alternative Responses
All of the variables had several kinds of alternative responses. These included: Refusal, Don’t Know, Valid Skip, Invalid Skip, and Non-Interview. While we attempted to recode these responses in order to include as much data as we could in our analysis, much of the time this was not possible as there was no clear way to interpret the data. As such, we ended up removing many data points. We removed all alternative responses for these variables:
- Income
- Highest degree
- Occupation
- Industry
- Childhood financial difficulty
- Marital status
- Urban vs. rural
- Spouse income
We removed all the alternative responses except valid skips for these variables:
- Disability
- Valid skip was recoded as “No”, i.e., they have not struggled with the aforementioned issue
- Total children
- Valid skip was recoded as “0” because these respondents did not have biological children -Age Incarcerated
- Valid skips were recoded to never incarcerated
Small Samples
Furthermore, there were several variables that contained factors that had less than 10 in their sample or no women. As such, we decided to exclude them as their small sample sizes would prevent them from providing a clear picture of the average income within those groups. These included these factors from these variables:
- Occupation
- Those working as life, physical, and social science technicians
- Those working in military specific occupations
- Those working in ACS special codes
- Those working in engineering or related technicians
- Those working in entertainment attendants and related workers
- Those working in farming, fishing, and forestry
- Those working in food preparation
- Marital status
- Total children
- Those who have 6 children
- Those who have 7 children
- Those who have 8 children
- Highest degree
- Industry
- Those working in mining
- Those working in utilities
- Those working in agriculture, forestries, and fisheries
Topcoding
In our dataset, the income variable has been topcoded for the top 2% of earners. This means that instead of each earner in the top 2% having their own unique data point (as is the case for the remaining 98%), they all instead share the average of their group which is $235884. This can pose some issues for both the graphical/tabular summaries and the regression analysis as it can skew the data. As such, throughout both of these sections, we have reviewed the results both with and without the topcoding and then decided which results to ultimately present. For our graphical summaries, we noticed that removing topcoded values gave us a clearer understanding of our results. However, we did not notice a significant difference in our regression analysis regardless of whether we did or did not include topcoded values.
| (Intercept) |
57202.82 |
779.271 |
73.406 |
0 |
| as.factor(sex)2 |
-15923.90 |
1118.772 |
-14.233 |
0 |
| (Intercept) |
56108.74 |
1003.466 |
55.915 |
0 |
| as.factor(sex)Female |
-14354.72 |
1387.154 |
-10.348 |
0 |
The spouse income variable also had topcoding at the 2% level. See the spouse income section under graphical and tabular summaries to see how this was handled.
Graphical and Tabular Summaries
In order to better understand how each variable relates to income and gender, we have created several graphical and tabular summaries for each of them.
Highest degree
We see that men are more likely than women to have finished their education at lower levels, such as None and GED. In turn, women are more likely than men to have Associate/Junior college degrees, Bachelor’s Degrees, Master’s Degrees and Professional Degrees (DDS, JD, MD). However, there is some uncertainty as these differences may not be great enough to be statistically significant.
| None |
Male |
47 |
37961.70 |
| None |
Female |
24 |
22312.50 |
| High School Diploma |
Male |
314 |
48866.85 |
| High School Diploma |
Female |
326 |
31655.10 |
| GED |
Male |
77 |
41310.73 |
| GED |
Female |
52 |
26876.69 |
| Associate/Junior College |
Male |
60 |
58579.73 |
| Associate/Junior College |
Female |
68 |
43258.82 |
| Bachelor’s Degree |
Male |
211 |
68167.04 |
| Bachelor’s Degree |
Female |
293 |
51839.94 |
| Master’s Degree |
Male |
53 |
79814.60 |
| Master’s Degree |
Female |
72 |
59119.94 |
| Professional Degree (DDS, JD, MD) |
Male |
5 |
119577.00 |
| Professional Degree (DDS, JD, MD) |
Female |
7 |
73842.86 |
Men have a higher average salary than women across every educational group.
Occupation
| OFFICE AND ADMINISTRATIVE SUPPORT WORKERS |
Male |
62 |
50785.05 |
| OFFICE AND ADMINISTRATIVE SUPPORT WORKERS |
Female |
162 |
35095.52 |
| EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL |
Male |
105 |
69064.24 |
| EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL |
Female |
105 |
56774.56 |
| MANAGEMENT RELATED |
Male |
52 |
74116.77 |
| MANAGEMENT RELATED |
Female |
59 |
57536.59 |
| MATHEMATICAL AND COMPUTER SCIENTISTS |
Male |
45 |
69774.00 |
| MATHEMATICAL AND COMPUTER SCIENTISTS |
Female |
11 |
70272.73 |
| ENGINEERS, ARCHITECTS, AND SURVEYORS |
Male |
17 |
83147.06 |
| ENGINEERS, ARCHITECTS, AND SURVEYORS |
Female |
2 |
101500.00 |
| PHYSICAL SCIENTISTS |
Male |
6 |
50750.00 |
| PHYSICAL SCIENTISTS |
Female |
5 |
44000.00 |
| SOCIAL SCIENTISTS AND RELATED WORKERS |
Male |
3 |
61333.33 |
| SOCIAL SCIENTISTS AND RELATED WORKERS |
Female |
7 |
59397.00 |
| COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS |
Male |
14 |
49907.14 |
| COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS |
Female |
29 |
48961.14 |
| LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS |
Male |
5 |
74600.00 |
| LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS |
Female |
8 |
52250.00 |
| TEACHERS |
Male |
27 |
50851.85 |
| TEACHERS |
Female |
91 |
39624.18 |
| EDUCATION, TRAINING, AND LIBRARY WORKERS |
Male |
4 |
50000.00 |
| EDUCATION, TRAINING, AND LIBRARY WORKERS |
Female |
11 |
20272.73 |
| ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS |
Male |
11 |
50909.09 |
| ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS |
Female |
13 |
58315.38 |
| MEDIA AND COMMUNICATION WORKERS |
Male |
6 |
67166.67 |
| MEDIA AND COMMUNICATION WORKERS |
Female |
14 |
39363.43 |
| HEALTH DIAGNOSIS AND TREATING PRACTITIONERS |
Male |
9 |
75542.78 |
| HEALTH DIAGNOSIS AND TREATING PRACTITIONERS |
Female |
58 |
60834.48 |
| HEALTH CARE TECHNICAL AND SUPPORT |
Male |
15 |
44600.00 |
| HEALTH CARE TECHNICAL AND SUPPORT |
Female |
53 |
28562.47 |
| PROTECTIVE SERVICE |
Male |
35 |
65165.91 |
| PROTECTIVE SERVICE |
Female |
11 |
31590.91 |
| FOOD PREPARATIONS AND SERVING RELATED |
Male |
25 |
28512.00 |
| FOOD PREPARATIONS AND SERVING RELATED |
Female |
33 |
25348.48 |
| CLEANING AND BUILDING SERVICE |
Male |
17 |
44529.41 |
| CLEANING AND BUILDING SERVICE |
Female |
12 |
14166.67 |
| PERSONAL CARE AND SERVICE WORKERS |
Male |
8 |
35625.00 |
| PERSONAL CARE AND SERVICE WORKERS |
Female |
46 |
22007.98 |
| SALES AND RELATED WORKERS |
Male |
65 |
55674.63 |
| SALES AND RELATED WORKERS |
Female |
77 |
38892.27 |
| CONSTRUCTION TRADES AND EXTRACTION WORKERS |
Male |
68 |
50305.82 |
| CONSTRUCTION TRADES AND EXTRACTION WORKERS |
Female |
3 |
26166.67 |
| INSTALLATION, MAINTENANCE, AND REPAIR WORKERS |
Male |
51 |
49820.88 |
| INSTALLATION, MAINTENANCE, AND REPAIR WORKERS |
Female |
3 |
55333.33 |
| PRODUCTION AND OPERATING WORKERS |
Male |
15 |
45266.67 |
| PRODUCTION AND OPERATING WORKERS |
Female |
5 |
29400.00 |
| SETTER, OPERATORS, AND TENDERS |
Male |
30 |
47100.00 |
| SETTER, OPERATORS, AND TENDERS |
Female |
13 |
36153.85 |
| TRANSPORTATION AND MATERIAL MOVING WORKERS |
Male |
72 |
42699.72 |
| TRANSPORTATION AND MATERIAL MOVING WORKERS |
Female |
11 |
23909.09 |
Men on average have a higher income than women in all occupations except:
- Engineers, architects and surveyors (though this small sub-sample only has 21 respondents)
- Entertainers and performers, sports and related workers (though this small sub-sample only has 24 respondents)
- Installation, maintenance, and repair workers (though there are only 3 women in this occupation)
- Mathematical and computer scientists (though the difference is slight)
It should be noted that the reasons provided in parentheses suggest there is statistical uncertainty in said comparisons.
Furthermore, we see that men and women, on average, tend to choose different careers. These differences appear to be statistically significant for certain careers. Men appear to be more prevalent in:
- Construction trades and extraction workers
- Installation, maintenance, and repair workers
- Mathematical and computer scientists
- Production and operating workers
- Protective service
- Setter, operators, and tenders
- Transportation and material moving workers
While women are more prevalent in:
- Counselors, social and religious workers
- Healthcare technical and support
- Health diagnosis and treating practitioners
- Office and Administrative support workers
- Personal care and service workers
- Teachers
In all other occupations, men and women appear to be present in around the same numbers.
As men and women tend to choose different career paths, it seems that women are more likely to be in occupations with lower salaries while men are more likely to be in occupations with higher salaries. For example, in the occupations of Office and Administrative Support Workers and Personal Care and Support Workers, both of which women tend to be prevalent in, these have average salaries of $57137.11 and $26712.07, respectively. On the other hand, men are more prevalent in the occupations of Construction Trades and Extraction Workers and Mathematical and Computer Scientists which have average salaries of $62919.4 and $62810.22, respectively.
Industry
| EDUCATIONAL, HEALTH, AND SOCIAL SERVICES |
Male |
94 |
54855.04 |
| EDUCATIONAL, HEALTH, AND SOCIAL SERVICES |
Female |
348 |
40014.54 |
| CONSTRUCTION |
Male |
92 |
52676.85 |
| CONSTRUCTION |
Female |
12 |
39125.00 |
| MANUFACTURING |
Male |
92 |
52863.33 |
| MANUFACTURING |
Female |
43 |
48290.70 |
| WHOLESALE TRADE |
Male |
33 |
59909.09 |
| WHOLESALE TRADE |
Female |
11 |
49636.36 |
| RETAIL TRADE |
Male |
54 |
46609.80 |
| RETAIL TRADE |
Female |
66 |
31904.55 |
| TRANSPORTATION AND WAREHOUSING |
Male |
45 |
50041.78 |
| TRANSPORTATION AND WAREHOUSING |
Female |
13 |
47153.85 |
| INFORMATION AND COMMUNICATION |
Male |
11 |
57590.91 |
| INFORMATION AND COMMUNICATION |
Female |
12 |
53416.67 |
| FINANCE, INSURANCE, AND REAL ESTATE |
Male |
60 |
66600.27 |
| FINANCE, INSURANCE, AND REAL ESTATE |
Female |
71 |
50113.69 |
| PROFESSIONAL AND RELATED SERVICES |
Male |
106 |
58773.08 |
| PROFESSIONAL AND RELATED SERVICES |
Female |
84 |
46411.90 |
| ACS SPECIAL CODES |
Male |
42 |
74471.02 |
| ACS SPECIAL CODES |
Female |
42 |
59523.81 |
| ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES |
Male |
49 |
37975.51 |
| ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES |
Female |
69 |
27030.41 |
| OTHER SERVICES |
Male |
28 |
48131.18 |
| OTHER SERVICES |
Female |
28 |
37039.29 |
| PUBLIC ADMINISTRATION |
Male |
61 |
69308.31 |
| PUBLIC ADMINISTRATION |
Female |
43 |
44680.40 |
According to the above table, men on average have a higher income than women in all industries. To see which of these differences in average incomes are statistically significant, see the corresponding graph in the Findings from Regression Analysis section.
Men and women also tend to choose different industries in some instances. The industries that have statistically significant greater amount of men are:
- Construction
- Manufacturing
- Professional and Related Services
- Public Administration
- Transportation and Warehousing
- Wholesale Trade
Women are more prevalent in:
- Educational, Health, and Social Services
In all other industries, men and women appear to be present in around the same numbers.
Similar to occupations, women seem to be more likely than men to be in less lucrative industries. While the female-dominant industry of educational, health, and social services has an average income of around $51406.86, the male-dominant industries of construction and manufacturing tend to have average salaries of $51113.17 and $55413.04.
Disability
| No |
Male |
722 |
56526.67 |
| No |
Female |
803 |
42215.67 |
| Yes |
Male |
45 |
49403.33 |
| Yes |
Female |
39 |
32248.77 |
Men appear to be more likely than women to have (or have ever had) some sort of disability. This difference does not appear to be statistically significant as the difference is only 6.
| No |
1525 |
48991.11 |
| Yes |
84 |
41438.71 |
While those without disabilities do appear to on average have higher incomes than those with, this difference does not hold when including sex. For example, men who have (or have ever had) disabilities have a higher average income than women who have never had a disability.
Childhood financial difficulty
| No |
Male |
726 |
56904.14 |
| No |
Female |
813 |
41999.24 |
| Yes |
Male |
41 |
42024.39 |
| Yes |
Female |
29 |
34879.31 |
Men appear to be more likely than women to have had childhood financial difficulties. However, this difference does not appear to be statistically significant as the difference is only 12.
| No |
1539 |
49030.40 |
| Yes |
70 |
39064.29 |
Those without childhood financial difficulties appear to on average have higher incomes than those with. This difference does hold when including sex. For example, men who have had childhood financial difficulties have a lower average income than women who have never had childhood financial difficulties. However, this difference is likely statistically insignificant as the difference is only $25.
Incarceration Age
| 16 TO 18: years |
Male |
15 |
40991.33 |
| 16 TO 18: years |
Female |
1 |
60000.00 |
| 19 TO 21: years |
Male |
22 |
37772.73 |
| 19 TO 21: years |
Female |
6 |
34666.67 |
| 22 TO 25: years |
Male |
25 |
43672.00 |
| 22 TO 25: years |
Female |
6 |
25500.00 |
| 26 TO 30: years |
Male |
11 |
29118.18 |
| 26 TO 30: years |
Female |
4 |
23625.00 |
| 31 TO 99: years |
Male |
5 |
55000.00 |
| Invalid Skip |
Male |
1 |
12000.00 |
| Valid Skip |
Male |
688 |
57980.28 |
| Valid Skip |
Female |
825 |
41989.56 |
Men outnumber women in being incarcerated regardless of age. Furthermore, those who were never incarcerated earn more than those who were. The one exception is men who were incarcerated between the ages of 31 and 99, as they earn more than women who were never incarcerated.
Urban vs. Rural
| Rural |
46556.64 |
412 |
| Urban |
49299.04 |
1197 |

The first table highlights that the urban dwellers have the highest average income. The violin plot also confirms this but shows that there may be outliers due to the small density of individuals towards the top of the maximum income.
Table also highlights that the majority of our non-missing respondents are urban dwellers. This means that we will have a closer approximation to the true value than in the case of rural respondents. However, since there is still a large sample for rural dwellers, we can still employ generalization.
The above violin plot shows that those in rural settings have a higher concentration of people at a lower income. Both female and male urban dwellers have a higher maximum income than their rural counterparts.
Include basic summaries such as count, etc. Tie into predictions of what kind of regression/how variable fits into regression
Total Children
| 0 |
Male |
297 |
52624.14 |
| 0 |
Female |
230 |
47496.58 |
| 1 |
Male |
208 |
56235.50 |
| 1 |
Female |
185 |
44808.45 |
| 2 |
Male |
163 |
61102.33 |
| 2 |
Female |
259 |
39855.14 |
| 3 |
Male |
78 |
59915.00 |
| 3 |
Female |
128 |
35954.13 |
| 4 |
Male |
17 |
49705.88 |
| 4 |
Female |
28 |
28982.14 |
| 5 |
Male |
4 |
57750.00 |
| 5 |
Female |
12 |
17250.00 |
We used the first table to drop all number of children groupings that did not have a sample size of at least 10. The above table indicates that there are more than 10 people in each group. Though we have not tested for statistical significance, the first table indicates that the income gap between men and women increases based on how many biological children a respondent has. Our conduction of the regression will allow us to see whether these differences are statistically significant.
The bar chart shows that there are more women respondents for all categories except for the 0 category. This may be due to women having a higher likelihood of being single parents. This graph in addition to the first table indicate that there is still a large enough sample size across categories.
Spouse Income
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 28000 42000 50926 64250 283919
The above table summarizes the distribution of spouse income before removing topcoding. It shows that the data is highly skewed to the right with a mean of $50,926 and a median of $42,000.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 28000 42000 47961 62000 173000
Similar to the income variable, spouse’s income is topcoded. The top 2% of values are based on average of $173000. We see that though the average spouse income is $47,961, the data is skewed to the right with a median of $42,000. This data is therefore less skewed than the data before topcoding. This suggests that our analysis will not be as skewed towards higher incomes, but will be more accurate with the median, which closely represents where the majority of respondents lie, improving the generalizability of our results.
| Never married |
Male |
139 |
45939.11 |
| Never married |
Female |
133 |
39477.95 |
| Married |
Male |
573 |
59536.42 |
| Married |
Female |
636 |
42752.24 |
| Separated |
Male |
9 |
41888.89 |
| Separated |
Female |
12 |
56125.00 |
| Divorced |
Male |
46 |
46923.85 |
| Divorced |
Female |
59 |
34193.14 |
| Widowed |
Female |
2 |
12500.00 |
After we remove the top coded variables, we see that an increase in spousal income is associated with an increase of individual income. From the correlation coefficient of 0.19 for income and spousal income, we can see that overall there is a weak but positive correlation between spousal income and income. This coincides with the graphs of spousal income versus income for both men and women. We would have to test for the apparently higher and stronger correlation between spouse income and income for men versus women to see if there is a statistically significant difference.
Marital Status
| Never married |
Male |
199 |
46756.46 |
| Never married |
Female |
181 |
39019.91 |
| Married |
Male |
746 |
58716.65 |
| Married |
Female |
892 |
41847.40 |
| Separated |
Male |
12 |
41833.33 |
| Separated |
Female |
21 |
56166.67 |
| Divorced |
Male |
61 |
48093.39 |
| Divorced |
Female |
78 |
35716.60 |
| Widowed |
Female |
3 |
25000.00 |
| NA |
Male |
6 |
53500.00 |
| NA |
Female |
1 |
21000.00 |
## NULL
The first table shows that, after removing invalid and topcoded values from our analysis, we only had 3 widowed females and no widowed males in our analysis. The above graph shows that it is hard to determine whether the differences in income by marital status are statistically significant. This could lead to the other variables absorbing the effect that the widowed factor would have had on income when we run a regression. This could cause standard error or inaccurately increase the effect size.
We see in the count of respondents table that the majority of respondents are married. By looking at average income in the tabular summary by marital status and sex, we can see that there does seem to be an income gap between men and women based on marital status.
Methodology for Linear Regression Model
Missing Values
As mentioned in the introduction, in the majority of the cases, we unfortunately had to simply remove the missing values. We had to do this as the dataset did not provide clear notes on how to interpret missing values. For example, in the case of childhood financial difficulty variable, 1084 of the responses were marked as “Valid Skip” but there was no note as to how this response differed from the response of “No.” In order to avoid conflating “No” with the possibly unclear response of “Valid Skip” we simply had to drop those 1084 rows. Dropping these rows will weaken the resulting interpretation and generalizability of our analysis not only because it reduces our sample size, but more importantly, it could make our overall sample less randomized.
This could be the case if those respondents who had the “Valid Skip” value were not a random sub-sample; as we have no reason to believe they were random, it is unlikely they were random. While it is likely that the initial sampling was done randomly from the overall population, it is unlikely that those with the “Valid Skip” response are also a random sub-sample of the dataset. In turn, if our overall sample is no longer random (or even less randomized), this would mean we are less able to generalize our results and make predictions regarding the general population.
Topcoded Variables
For the topcoded variables of income and spouse income, we ended up removing the topcoded rows entirely. We did this for two reasons. Firstly, it helped make our graphs more clear as it reduced their skewness. For example, for our violin plots when describing the urban vs. rural variable, we noticed that the presence of the topcoded rows skewed our plots towards a higher income, thereby making it harder to interpret the larger distributions near the lower end of the income axis. Secondly, we noticed that removing the topcoding did not significantly change our regression results in terms of which variables were and were not significant.
Plots We Tried
We produced a scatter plot of spouse income and the individual’s income. We expected to find that as the man’s income increased, his spouse’s income would decrease. In conjunction, we would expect the women’s income to decrease as her spouse’s income increased. This did not hold true based on our graph of statistical significance.
We also produced a scatter plot of age at first incarceration and income. We expected that those who had an earlier age of first incarceration would have a lower income than those who were incarcerated at a later age. Instead, we found that the overall trend of the scatter plot was a horizontal line, suggesting that regardless of when one is first incarcerated, they can expect to have a similar, below average income.
Variable Selection
To begin our variable selection process, we first began constructing linear regression model comparisons with and without each variable. From this, we found that only the urban-rural variable is not a significant predictor in the regression of income on sex.
To narrow down what variables to use in our analysis, we examined whether there was collinearity between variables. If we have variables with collinearity, then we have difficulty interpreting how those variables uniquely determine the results as they are associated with another variable. Collinearity therefore impacts the accuracy of our interpretations.
To determine collinearity, we run the following pairs plot.

The above plot shows that there is not a specific two-variable combination in which a large proportion only falls into one of the combined categories. For example, for men and women, we do not see all women fall into rural and all men fall into urban. This testament holds for all the variables presented in the plot. Since there does not seem to be any cases in which knowing the value of one variable means we know the value of the other, we can confidently use a combination of these variables in our regression analyses.
After doing the graphical and tabular summaries along with the above pairs plot, we felt that the following four variables had the greatest variance income between genders.
- Highest degree
- Spouse income
- Total children
- Occupation
- Industry
While the other variables provided informative descriptions, we decided to not include them either because they were most applicable only to a small sub-sample of the dataset (such as childhood financial difficulty, disability, and age incarcerated), or because we simply did not want to over-complicate the following regressions with too many variables.
Findings from Regression Analysis
We utilized the above methodology to determine what variables we want to test and then include in our analysis.
For our variables, we used the following baselines: - Sex: Male- to highlight how much women are disadvantage in terms of income - Highest degree: None- to see the impact of educational capital - Total children: 0- to see the association between the number of children - Occupation: Office and Administrative Support Workers- it’s the most populous category so will improve generalizability - Industry: Educational Health and Social Services- it’s the most populous category so will improve generalizability
Regressing Income on Sex
We first begin with an assessment of the relationship between sex and income. We start with the following model:
Income = Intercept + \(\beta\) * sex
| (Intercept) |
56108.74 |
1003.466 |
55.915 |
0 |
| sexFemale |
-14354.72 |
1387.154 |
-10.348 |
0 |
Female represents the baseline for sex. From the above model, we see that women tend to earn an average of 14354.72 less than men. With a p-value of 0, this is significant at the 5% significance level.The next step is to determine whether this statistically significant difference holds when including the effects of other variables.
Adding highest degree and urban vs. rural to determine significance
In this section, we look at at the impact of adding the variables education and urban rural to our regression to determine if it changes the effect size that sex has on income. First we look at the regression model:
| (Intercept) |
36420.216 |
3212.338 |
11.338 |
0.0000 |
| sexFemale |
-16952.485 |
1258.239 |
-13.473 |
0.0000 |
| highest.degree.attained.2017High School Diploma |
10470.646 |
3135.558 |
3.339 |
0.0009 |
| highest.degree.attained.2017GED |
4087.557 |
3696.377 |
1.106 |
0.2690 |
| highest.degree.attained.2017Associate/Junior College |
21386.823 |
3712.514 |
5.761 |
0.0000 |
| highest.degree.attained.2017Bachelor’s Degree |
30283.244 |
3184.942 |
9.508 |
0.0000 |
| highest.degree.attained.2017Master’s Degree |
39244.067 |
3727.263 |
10.529 |
0.0000 |
| highest.degree.attained.2017Professional Degree (DDS, JD, MD) |
64721.585 |
7811.357 |
8.286 |
0.0000 |
| urban.ruralUrban |
2468.847 |
1432.018 |
1.724 |
0.0849 |
Through this table we see that the sexMale coefficient is statistically significant at the 5% significance level. We also see that the coefficients for highest degree attained for Bachelor’s, Master’s, PH.D., and Professional Degrees are also statistically significant at the 5% significance level in comparison to the baseline of no education attained. This indicates that gender in addition to highest degree attained can influence income. The urban coefficient suggests that those in urban areas earn $2468.85 more on average than those in rural settings, all other factors being constant. However, this difference is not statistically significant at the 5% level.
To check whether the highest degree attained variable is a significant association within the above model, we remove the urban-rural variable and compare the model to that of income on sex.
By running an ANOVA for the model with and without education, we see that the p-value 0 is statistically significant and highest education attained is a important variable for modeling and explaining income.
Therefore, we run an ANOVA below to determine whether the urban-rural variable is significant.
## Analysis of Variance Table
##
## Model 1: income ~ sex + urban.rural
## Model 2: income ~ sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1606 1.2392e+12
## 2 1607 1.2411e+12 -1 -1900207595 2.4626 0.1168
The urban coefficient was not significant in the model. The ANOVA confirms this as the p=value is 0.1168, which is not statistically significant at the 5% level. The urban-rural variable therefore is not a significant variable in modeling income. We therefore removed it from our next regression.
Adding occupation and determining significance
In this section, we look at at the impact of adding occupation to determine if it changes the effect size that sex has on income. First we look at the regression model:
| (Intercept) |
49317.101 |
1970.242 |
25.031 |
0.0000 |
| sexFemale |
-13659.769 |
1433.295 |
-9.530 |
0.0000 |
| occupation.2017EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL |
20432.184 |
2429.861 |
8.409 |
0.0000 |
| occupation.2017MANAGEMENT RELATED |
23247.380 |
2923.719 |
7.951 |
0.0000 |
| occupation.2017MATHEMATICAL AND COMPUTER SCIENTISTS |
23238.032 |
3821.893 |
6.080 |
0.0000 |
| occupation.2017ENGINEERS, ARCHITECTS, AND SURVEYORS |
37199.717 |
6057.151 |
6.141 |
0.0000 |
| occupation.2017PHYSICAL SCIENTISTS |
4573.703 |
7753.942 |
0.590 |
0.5554 |
| occupation.2017SOCIAL SCIENTISTS AND RELATED WORKERS |
20222.637 |
8105.136 |
2.495 |
0.0127 |
| occupation.2017COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS |
9164.441 |
4175.717 |
2.195 |
0.0283 |
| occupation.2017LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS |
19935.065 |
7155.707 |
2.786 |
0.0054 |
| occupation.2017TEACHERS |
3410.348 |
2853.296 |
1.195 |
0.2322 |
| occupation.2017EDUCATION, TRAINING, AND LIBRARY WORKERS |
-11099.937 |
6688.105 |
-1.660 |
0.0972 |
| occupation.2017ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS |
13002.774 |
5392.314 |
2.411 |
0.0160 |
| occupation.2017MEDIA AND COMMUNICATION WORKERS |
7949.137 |
5852.422 |
1.358 |
0.1746 |
| occupation.2017HEALTH DIAGNOSIS AND TREATING PRACTITIONERS |
25317.998 |
3497.833 |
7.238 |
0.0000 |
| occupation.2017HEALTH CARE TECHNICAL AND SUPPORT |
-6570.354 |
3472.982 |
-1.892 |
0.0587 |
| occupation.2017PROTECTIVE SERVICE |
11086.474 |
4118.175 |
2.692 |
0.0072 |
| occupation.2017FOOD PREPARATIONS AND SERVING RELATED |
-14833.094 |
3701.140 |
-4.008 |
0.0001 |
| occupation.2017CLEANING AND BUILDING SERVICE |
-11699.265 |
4968.747 |
-2.355 |
0.0187 |
| occupation.2017PERSONAL CARE AND SERVICE WORKERS |
-13655.686 |
3806.137 |
-3.588 |
0.0003 |
| occupation.2017SALES AND RELATED WORKERS |
4664.295 |
2702.431 |
1.726 |
0.0845 |
| occupation.2017CONSTRUCTION TRADES AND EXTRACTION WORKERS |
545.931 |
3552.038 |
0.154 |
0.8779 |
| occupation.2017INSTALLATION, MAINTENANCE, AND REPAIR WORKERS |
1568.905 |
3920.260 |
0.400 |
0.6891 |
| occupation.2017PRODUCTION AND OPERATING WORKERS |
-4602.159 |
5891.500 |
-0.781 |
0.4348 |
| occupation.2017SETTER, OPERATORS, AND TENDERS |
-1396.706 |
4218.488 |
-0.331 |
0.7406 |
| occupation.2017TRANSPORTATION AND MATERIAL MOVING WORKERS |
-7297.373 |
3331.757 |
-2.190 |
0.0287 |
Through this table we see that only the sexMale coefficient is statistically significant, with a p-value of 0. The lack of significance for the occupation factors may be due to the existence of collinearity as occupation.
Therefore, we run an ANOVA below to determine whether adding occupation is significant.
## Analysis of Variance Table
##
## Model 1: income ~ sex + occupation.2017
## Model 2: income ~ sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1583 9.9547e+11
## 2 1607 1.2411e+12 -24 -2.4566e+11 16.277 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
None of the occupation factors were significant, so we tried removing it, but as the ANOVA test showed a significant result of 0, this tells us that the occupation.2017 variable is an important variable in modeling income.
For the last regression, we will be looking at the impact of these variables on income: gender, highest degree, occupation, and total children.
| (Intercept) |
31150.119 |
3850.066 |
8.091 |
0.0000 |
| sexFemale |
-12400.533 |
1364.599 |
-9.087 |
0.0000 |
| highest.degree.attained.2017High School Diploma |
6753.478 |
3016.304 |
2.239 |
0.0253 |
| highest.degree.attained.2017GED |
859.210 |
3508.766 |
0.245 |
0.8066 |
| highest.degree.attained.2017Associate/Junior College |
12912.009 |
3596.958 |
3.590 |
0.0003 |
| highest.degree.attained.2017Bachelor’s Degree |
22897.515 |
3199.716 |
7.156 |
0.0000 |
| highest.degree.attained.2017Master’s Degree |
31909.517 |
3752.446 |
8.504 |
0.0000 |
| highest.degree.attained.2017Professional Degree (DDS, JD, MD) |
49677.561 |
7593.481 |
6.542 |
0.0000 |
| occupation.2017EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL |
15840.727 |
2325.775 |
6.811 |
0.0000 |
| occupation.2017MANAGEMENT RELATED |
14217.206 |
2822.122 |
5.038 |
0.0000 |
| occupation.2017MATHEMATICAL AND COMPUTER SCIENTISTS |
18240.352 |
3657.971 |
4.986 |
0.0000 |
| occupation.2017ENGINEERS, ARCHITECTS, AND SURVEYORS |
26028.188 |
5759.531 |
4.519 |
0.0000 |
| occupation.2017PHYSICAL SCIENTISTS |
-11756.214 |
7315.070 |
-1.607 |
0.1082 |
| occupation.2017SOCIAL SCIENTISTS AND RELATED WORKERS |
12498.840 |
7576.065 |
1.650 |
0.0992 |
| occupation.2017COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS |
565.146 |
4083.932 |
0.138 |
0.8900 |
| occupation.2017LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS |
11658.653 |
6866.191 |
1.698 |
0.0897 |
| occupation.2017TEACHERS |
-2421.154 |
3088.787 |
-0.784 |
0.4332 |
| occupation.2017EDUCATION, TRAINING, AND LIBRARY WORKERS |
-12875.246 |
6337.332 |
-2.032 |
0.0424 |
| occupation.2017ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS |
9454.633 |
5095.827 |
1.855 |
0.0637 |
| occupation.2017MEDIA AND COMMUNICATION WORKERS |
4977.600 |
5568.606 |
0.894 |
0.3715 |
| occupation.2017HEALTH DIAGNOSIS AND TREATING PRACTITIONERS |
20431.728 |
3572.076 |
5.720 |
0.0000 |
| occupation.2017HEALTH CARE TECHNICAL AND SUPPORT |
-1188.352 |
3468.331 |
-0.343 |
0.7319 |
| occupation.2017PROTECTIVE SERVICE |
8840.900 |
4133.087 |
2.139 |
0.0326 |
| occupation.2017FOOD PREPARATIONS AND SERVING RELATED |
-823.155 |
3989.492 |
-0.206 |
0.8366 |
| occupation.2017CLEANING AND BUILDING SERVICE |
-1213.177 |
4774.095 |
-0.254 |
0.7994 |
| occupation.2017PERSONAL CARE AND SERVICE WORKERS |
-9802.235 |
3694.125 |
-2.653 |
0.0080 |
| occupation.2017SALES AND RELATED WORKERS |
7607.294 |
2793.989 |
2.723 |
0.0065 |
| occupation.2017CONSTRUCTION TRADES AND EXTRACTION WORKERS |
3954.315 |
4264.581 |
0.927 |
0.3539 |
| occupation.2017INSTALLATION, MAINTENANCE, AND REPAIR WORKERS |
6977.528 |
3724.404 |
1.873 |
0.0612 |
| occupation.2017PRODUCTION AND OPERATING WORKERS |
-3558.503 |
5684.112 |
-0.626 |
0.5314 |
| occupation.2017SETTER, OPERATORS, AND TENDERS |
1336.069 |
4216.560 |
0.317 |
0.7514 |
| occupation.2017TRANSPORTATION AND MATERIAL MOVING WORKERS |
-2503.235 |
3272.196 |
-0.765 |
0.4444 |
| children.total1 |
2974.023 |
1558.332 |
1.908 |
0.0565 |
| children.total2 |
1565.810 |
1556.161 |
1.006 |
0.3145 |
| children.total3 |
3022.314 |
1951.261 |
1.549 |
0.1216 |
| children.total4 |
1230.200 |
3697.200 |
0.333 |
0.7394 |
| children.total5 |
-9984.058 |
5950.261 |
-1.678 |
0.0936 |
| industry.2017CONSTRUCTION |
7034.963 |
3775.896 |
1.863 |
0.0626 |
| industry.2017MANUFACTURING |
6512.208 |
2983.576 |
2.183 |
0.0292 |
| industry.2017WHOLESALE TRADE |
10525.835 |
4061.888 |
2.591 |
0.0096 |
| industry.2017RETAIL TRADE |
-4246.831 |
3087.397 |
-1.376 |
0.1692 |
| industry.2017TRANSPORTATION AND WAREHOUSING |
10015.124 |
3796.602 |
2.638 |
0.0084 |
| industry.2017INFORMATION AND COMMUNICATION |
6221.390 |
5316.231 |
1.170 |
0.2421 |
| industry.2017FINANCE, INSURANCE, AND REAL ESTATE |
6825.431 |
2783.743 |
2.452 |
0.0143 |
| industry.2017PROFESSIONAL AND RELATED SERVICES |
1595.929 |
2541.819 |
0.628 |
0.5302 |
| industry.2017ACS SPECIAL CODES |
14390.481 |
3108.566 |
4.629 |
0.0000 |
| industry.2017ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES |
-6381.503 |
3038.542 |
-2.100 |
0.0359 |
| industry.2017OTHER SERVICES |
531.425 |
3553.856 |
0.150 |
0.8812 |
| industry.2017PUBLIC ADMINISTRATION |
9080.857 |
3060.031 |
2.968 |
0.0030 |
In this model the baseline intercept refers to a female, with a highest degree of None (which means less than a high school diploma), an occupation in Office and Administrative Support Workers, 0 total children, Educational Health and Social Services industry, and one who lives in an urban community. This individual has a predicted income of $31150.12. As all the variables in this model are categorical, one would simply add or subtract the given coefficients in order to change the interpretation. For example, if one wanted to change this person’s gender and instead have them work in the construction industry, one would first increase their salary by $12401 (for becoming male) but then subtract $3954 (for the construction occupation). As such, their new salary would be $47504.97
In this model, the following factors of the variables are significant:
- Gender: Male
- Highest Degree: Associates degree and up
- Occupation: Education, Training, and Library Workers, Health Diagnosis and Treating Practitioners, Protective Service, Personal Care and Service Workers, Sales and Related Workers
- Industry: Manufacturing; Public Administration; Entertainment, Accommodations, and Food Services; ACS Special Codes, Finance, Insurance and Real Estate, Transportation and Warehousing, Wholesale Trade
As only one of the total children level’s was statistically significant, let’s try removing the total children variable from the model and conduct an ANOVA test to see if there is a difference between these two models: one with it and one without it.
## Analysis of Variance Table
##
## Model 1: income ~ sex + highest.degree.attained.2017 + occupation.2017 +
## industry.2017
## Model 2: income ~ sex + highest.degree.attained.2017 + occupation.2017 +
## children.total + industry.2017
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1565 8.3645e+11
## 2 1560 8.3195e+11 5 4.5e+09 1.6876 0.1344
As the ANOVA test has provided us a value of 0.1344, which is slightly greater than 0.05, we should conclude that the total children variable does not significantly improve the model and that using the model without the total children variable would be adequate.
Difference in Means for Industry
To look at statistical significance in the income gap between men and women, we create a difference in means visual for industry.
The greatest gender differences in mean income are found in the industries of public administration, finance, insurance, and real estate, and acs special codes. Furthermore, there is not a statistically significant difference in the differences between industries, i.e., they may all be around the same.
Adding Highest Degree with Gender as an Interaction Term
| (Intercept) |
29122.500 |
4322.830 |
6.737 |
0.0000 |
| sexFemale |
-7546.351 |
5989.590 |
-1.260 |
0.2079 |
| highest.degree.attained.2017High School Diploma |
8061.460 |
3697.840 |
2.180 |
0.0294 |
| highest.degree.attained.2017GED |
1539.498 |
4327.055 |
0.356 |
0.7221 |
| highest.degree.attained.2017Associate/Junior College |
14977.253 |
4623.190 |
3.240 |
0.0012 |
| highest.degree.attained.2017Bachelor’s Degree |
24487.805 |
3938.623 |
6.217 |
0.0000 |
| highest.degree.attained.2017Master’s Degree |
36036.210 |
4927.199 |
7.314 |
0.0000 |
| highest.degree.attained.2017Professional Degree (DDS, JD, MD) |
66732.303 |
11158.016 |
5.981 |
0.0000 |
| occupation.2017EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL |
15725.311 |
2332.803 |
6.741 |
0.0000 |
| occupation.2017MANAGEMENT RELATED |
14007.383 |
2838.636 |
4.935 |
0.0000 |
| occupation.2017MATHEMATICAL AND COMPUTER SCIENTISTS |
17937.501 |
3668.245 |
4.890 |
0.0000 |
| occupation.2017ENGINEERS, ARCHITECTS, AND SURVEYORS |
25140.670 |
5812.275 |
4.325 |
0.0000 |
| occupation.2017PHYSICAL SCIENTISTS |
-12033.529 |
7321.272 |
-1.644 |
0.1005 |
| occupation.2017SOCIAL SCIENTISTS AND RELATED WORKERS |
12720.662 |
7587.410 |
1.677 |
0.0938 |
| occupation.2017COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS |
734.936 |
4099.695 |
0.179 |
0.8578 |
| occupation.2017LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS |
12309.761 |
6874.780 |
1.791 |
0.0736 |
| occupation.2017TEACHERS |
-2187.736 |
3106.671 |
-0.704 |
0.4814 |
| occupation.2017EDUCATION, TRAINING, AND LIBRARY WORKERS |
-12654.468 |
6352.197 |
-1.992 |
0.0465 |
| occupation.2017ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS |
9350.604 |
5103.843 |
1.832 |
0.0671 |
| occupation.2017MEDIA AND COMMUNICATION WORKERS |
5090.576 |
5595.179 |
0.910 |
0.3631 |
| occupation.2017HEALTH DIAGNOSIS AND TREATING PRACTITIONERS |
20607.523 |
3609.104 |
5.710 |
0.0000 |
| occupation.2017HEALTH CARE TECHNICAL AND SUPPORT |
-1084.312 |
3474.870 |
-0.312 |
0.7550 |
| occupation.2017PROTECTIVE SERVICE |
9253.575 |
4169.016 |
2.220 |
0.0266 |
| occupation.2017FOOD PREPARATIONS AND SERVING RELATED |
-875.810 |
3997.626 |
-0.219 |
0.8266 |
| occupation.2017CLEANING AND BUILDING SERVICE |
-1264.085 |
4788.769 |
-0.264 |
0.7918 |
| occupation.2017PERSONAL CARE AND SERVICE WORKERS |
-9889.507 |
3702.539 |
-2.671 |
0.0076 |
| occupation.2017SALES AND RELATED WORKERS |
7526.588 |
2800.234 |
2.688 |
0.0073 |
| occupation.2017CONSTRUCTION TRADES AND EXTRACTION WORKERS |
4278.379 |
4309.142 |
0.993 |
0.3209 |
| occupation.2017INSTALLATION, MAINTENANCE, AND REPAIR WORKERS |
7244.627 |
3767.122 |
1.923 |
0.0546 |
| occupation.2017PRODUCTION AND OPERATING WORKERS |
-3631.021 |
5706.412 |
-0.636 |
0.5247 |
| occupation.2017SETTER, OPERATORS, AND TENDERS |
1246.310 |
4231.079 |
0.295 |
0.7684 |
| occupation.2017TRANSPORTATION AND MATERIAL MOVING WORKERS |
-2130.291 |
3314.378 |
-0.643 |
0.5205 |
| children.total1 |
3043.007 |
1559.458 |
1.951 |
0.0512 |
| children.total2 |
1559.399 |
1558.995 |
1.000 |
0.3173 |
| children.total3 |
3015.890 |
1959.927 |
1.539 |
0.1241 |
| children.total4 |
924.775 |
3715.019 |
0.249 |
0.8034 |
| children.total5 |
-10521.018 |
5964.271 |
-1.764 |
0.0779 |
| industry.2017CONSTRUCTION |
7647.922 |
3790.084 |
2.018 |
0.0438 |
| industry.2017MANUFACTURING |
7126.171 |
3003.763 |
2.372 |
0.0178 |
| industry.2017WHOLESALE TRADE |
10995.679 |
4069.815 |
2.702 |
0.0070 |
| industry.2017RETAIL TRADE |
-3895.169 |
3094.730 |
-1.259 |
0.2083 |
| industry.2017TRANSPORTATION AND WAREHOUSING |
10523.842 |
3805.112 |
2.766 |
0.0057 |
| industry.2017INFORMATION AND COMMUNICATION |
6811.518 |
5344.396 |
1.275 |
0.2027 |
| industry.2017FINANCE, INSURANCE, AND REAL ESTATE |
7299.267 |
2792.241 |
2.614 |
0.0090 |
| industry.2017PROFESSIONAL AND RELATED SERVICES |
2056.298 |
2553.970 |
0.805 |
0.4209 |
| industry.2017ACS SPECIAL CODES |
14347.390 |
3112.140 |
4.610 |
0.0000 |
| industry.2017ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES |
-6243.466 |
3054.412 |
-2.044 |
0.0411 |
| industry.2017OTHER SERVICES |
975.589 |
3563.656 |
0.274 |
0.7843 |
| industry.2017PUBLIC ADMINISTRATION |
9474.393 |
3069.222 |
3.087 |
0.0021 |
| sexFemale:highest.degree.attained.2017High School Diploma |
-4189.375 |
6163.425 |
-0.680 |
0.4968 |
| sexFemale:highest.degree.attained.2017GED |
-2675.755 |
7204.382 |
-0.371 |
0.7104 |
| sexFemale:highest.degree.attained.2017Associate/Junior College |
-5546.329 |
7238.672 |
-0.766 |
0.4437 |
| sexFemale:highest.degree.attained.2017Bachelor’s Degree |
-4529.489 |
6333.575 |
-0.715 |
0.4746 |
| sexFemale:highest.degree.attained.2017Master’s Degree |
-8776.521 |
7361.676 |
-1.192 |
0.2334 |
| sexFemale:highest.degree.attained.2017Professional Degree (DDS, JD, MD) |
-31525.150 |
14904.481 |
-2.115 |
0.0346 |
As several of the levels of the variable education were statistically significant, as was the gender male variable. We decided to run an interaction model. However, it seems that none of the interaction terms (those that contain semicolons near the bottom) are statistically significant. However, to test this for sure, we can conduct an ANOVA test.
Therefore, we run an ANOVA below to determine whether highest degree with sex is significant.
## Analysis of Variance Table
##
## Model 1: income ~ sex + highest.degree.attained.2017 + occupation.2017 +
## children.total + industry.2017 + sex:highest.degree.attained.2017
## Model 2: income ~ sex + highest.degree.attained.2017 + occupation.2017 +
## children.total + industry.2017
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1554 8.2894e+11
## 2 1560 8.3195e+11 -6 -3006162269 0.9393 0.4656
As the ANOVA test provided a p-value of 0.4656, this tells us that the new interaction model is not better and predicting income than the old model. We know this as the test statistic is greater than 0.05. As such, it is better to simply use the older model.
Adding Spouse Income
## Analysis of Variance Table
##
## Model 1: income ~ sex + highest.degree.attained.2017 + occupation.2017 +
## industry.2017
## Model 2: income ~ sex + highest.degree.attained.2017 + occupation.2017 +
## spouse.income + industry.2017
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1565 8.3645e+11
## 2 1564 8.0841e+11 1 2.8041e+10 54.25 2.844e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| (Intercept) |
27224.225 |
3725.551 |
7.307 |
0.0000 |
| sexFemale |
-13708.517 |
1339.199 |
-10.236 |
0.0000 |
| highest.degree.attained.2017High School Diploma |
5531.115 |
2952.943 |
1.873 |
0.0612 |
| highest.degree.attained.2017GED |
533.256 |
3443.581 |
0.155 |
0.8770 |
| highest.degree.attained.2017Associate/Junior College |
11006.864 |
3527.716 |
3.120 |
0.0018 |
| highest.degree.attained.2017Bachelor’s Degree |
19540.780 |
3142.300 |
6.219 |
0.0000 |
| highest.degree.attained.2017Master’s Degree |
28452.589 |
3679.541 |
7.733 |
0.0000 |
| highest.degree.attained.2017Professional Degree (DDS, JD, MD) |
44981.343 |
7487.233 |
6.008 |
0.0000 |
| occupation.2017EXECUTIVE, ADMINISTRATIVE AND MANAGERIAL |
15457.104 |
2287.817 |
6.756 |
0.0000 |
| occupation.2017MANAGEMENT RELATED |
12888.630 |
2782.063 |
4.633 |
0.0000 |
| occupation.2017MATHEMATICAL AND COMPUTER SCIENTISTS |
19438.892 |
3603.611 |
5.394 |
0.0000 |
| occupation.2017ENGINEERS, ARCHITECTS, AND SURVEYORS |
25293.579 |
5656.872 |
4.471 |
0.0000 |
| occupation.2017PHYSICAL SCIENTISTS |
-12155.443 |
7197.949 |
-1.689 |
0.0915 |
| occupation.2017SOCIAL SCIENTISTS AND RELATED WORKERS |
9277.854 |
7444.613 |
1.246 |
0.2129 |
| occupation.2017COUNSELORS, SOCIAL, AND RELIGIOUS WORKERS |
1364.657 |
4014.649 |
0.340 |
0.7340 |
| occupation.2017LAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS |
11160.126 |
6749.957 |
1.653 |
0.0985 |
| occupation.2017TEACHERS |
-1521.828 |
3025.818 |
-0.503 |
0.6151 |
| occupation.2017EDUCATION, TRAINING, AND LIBRARY WORKERS |
-12408.687 |
6217.108 |
-1.996 |
0.0461 |
| occupation.2017ENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS |
8123.612 |
5016.040 |
1.620 |
0.1055 |
| occupation.2017MEDIA AND COMMUNICATION WORKERS |
3721.082 |
5477.310 |
0.679 |
0.4970 |
| occupation.2017HEALTH DIAGNOSIS AND TREATING PRACTITIONERS |
20287.964 |
3514.028 |
5.773 |
0.0000 |
| occupation.2017HEALTH CARE TECHNICAL AND SUPPORT |
-325.224 |
3415.503 |
-0.095 |
0.9242 |
| occupation.2017PROTECTIVE SERVICE |
9475.883 |
4061.561 |
2.333 |
0.0198 |
| occupation.2017FOOD PREPARATIONS AND SERVING RELATED |
362.625 |
3919.659 |
0.093 |
0.9263 |
| occupation.2017CLEANING AND BUILDING SERVICE |
-394.086 |
4697.288 |
-0.084 |
0.9331 |
| occupation.2017PERSONAL CARE AND SERVICE WORKERS |
-10455.879 |
3631.951 |
-2.879 |
0.0040 |
| occupation.2017SALES AND RELATED WORKERS |
7206.923 |
2748.464 |
2.622 |
0.0088 |
| occupation.2017CONSTRUCTION TRADES AND EXTRACTION WORKERS |
4428.304 |
4191.003 |
1.057 |
0.2908 |
| occupation.2017INSTALLATION, MAINTENANCE, AND REPAIR WORKERS |
7099.678 |
3663.910 |
1.938 |
0.0528 |
| occupation.2017PRODUCTION AND OPERATING WORKERS |
-2739.127 |
5589.974 |
-0.490 |
0.6242 |
| occupation.2017SETTER, OPERATORS, AND TENDERS |
669.241 |
4149.261 |
0.161 |
0.8719 |
| occupation.2017TRANSPORTATION AND MATERIAL MOVING WORKERS |
-2430.390 |
3218.669 |
-0.755 |
0.4503 |
| spouse.income |
0.156 |
0.021 |
7.365 |
0.0000 |
| industry.2017CONSTRUCTION |
6778.906 |
3709.183 |
1.828 |
0.0678 |
| industry.2017MANUFACTURING |
7993.682 |
2934.881 |
2.724 |
0.0065 |
| industry.2017WHOLESALE TRADE |
12041.343 |
3992.170 |
3.016 |
0.0026 |
| industry.2017RETAIL TRADE |
-3635.109 |
3029.722 |
-1.200 |
0.2304 |
| industry.2017TRANSPORTATION AND WAREHOUSING |
10891.397 |
3733.747 |
2.917 |
0.0036 |
| industry.2017INFORMATION AND COMMUNICATION |
7991.496 |
5231.182 |
1.528 |
0.1268 |
| industry.2017FINANCE, INSURANCE, AND REAL ESTATE |
7858.467 |
2737.953 |
2.870 |
0.0042 |
| industry.2017PROFESSIONAL AND RELATED SERVICES |
2124.860 |
2490.778 |
0.853 |
0.3937 |
| industry.2017ACS SPECIAL CODES |
15379.004 |
3057.980 |
5.029 |
0.0000 |
| industry.2017ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES |
-5880.957 |
2982.318 |
-1.972 |
0.0488 |
| industry.2017OTHER SERVICES |
1683.793 |
3495.890 |
0.482 |
0.6301 |
| industry.2017PUBLIC ADMINISTRATION |
9969.962 |
3008.193 |
3.314 |
0.0009 |
For our last regression, we decided to add spouse income to the last model. We see that the spouse.income variable has a p-value that is significant at the 5% level.
## Analysis of Variance Table
##
## Model 1: income ~ sex + highest.degree.attained.2017 + occupation.2017 +
## industry.2017
## Model 2: income ~ sex + highest.degree.attained.2017 + occupation.2017 +
## spouse.income + industry.2017
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1565 8.3645e+11
## 2 1564 8.0841e+11 1 2.8041e+10 54.25 2.844e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To be certain in our results, we conduct an ANOVA test. Our p-value of 0 is also significant at the 5% level.
We ran the following diagnostic plots below to compare the fit of our models.
Comparing Goodness of Fit Between Models via Diagnostic Plots
Now that we have done multiple iterations of our regression, and have determined what variables to add, it would be useful to compare the fit of our data between our first and final regression.
The first model we are comparing is the regression of income on sex:
Income = Intercept + \(\beta\) * sex
The second model that we are comparing is the regression of income on :
Income = Intercept + \(\beta\) * sex + \(\beta\) * highest education attained + \(\beta\) * occupation + \(\beta\) * spouse income + \(\beta\) * industry
Diagnostics for Model 1




Residuals vs. Fitted The residuals versus fitted plot indicates that the residuals do not have constant variance. The clear pattern of two distinct lines suggests the linear model is not appropriate for this data.
Normal QQ plot Since this plot indicates whether the quantiles of the normal distribution matches those of the response variable, we can be certain that the response variable is not normally distributed. Especially at the higher tail end, the quantiles of the response variable do not overlay that of the normal distribution. This indicates that our p-values may not be believable.
Scale-location plot The scale location is similar to the residuals versus fitted. Though for both there is a straight line, there are two distinct lines of data on the plot. However, the ideal does indicate that there should be a constant slope.
Residuals vs Leverage.
The residuals versus leverage plot for Model 1 does not even show a Cook’s line. Therefore there are no observations that have both a high residual and are influential (high leverage). This means there are no outliers in this set.
Diagnostics for Model 2



Residuals vs. Fitted The residuals in this plot have a less defined pattern compared to the first model. They also have a more constant variance, which indicates that a linear model is more appropriate for model 2 versus Model 1.
Normal QQ plot Similar to the Normal QQ plot in Model 1, the tail ends of the response variable’s distribution are not perfectly aligned with that of the normal distribution plot. However, we do see that the lower tails of Model 2’s Normal QQ plot are more closely aligned with the normal distribution. However, since the upper tail of Model 2’s plot is still not as closely aligned, we should still believe the p-values with caution.
Scale-location plot The scale-location plot for Model 2 does not match the ideal of a horizontal line that shows constant variance for residuals. Compared to Model 1, it has a less discernible pattern but does not have a horizontal slope for the line.
Residuals vs Leverage
Unlike in Model 1, the Cook’s line shows up in Model 2. This means that there are more observations with a higher leverage and residuals than in Model 1. However, since there are no observations with both high leverage and residuals, according to the Cook’s line, we do not have any outliers in our Model 2.
Discussion
Main Conclusions
Through our analysis, we sought to answer the following question:
Is there a significant difference in income between men and women? Does the difference vary depending on other factors?
In response the the first question, we found that the answer was yes. Men, on average, earn $14354.72 than women. More precisely, the average man in the US makes $56108.74 and the average women makes $41754.02. This means that the average women makes 0.74 cents for every dollar a man makes. This linear regression also had a p-value of 0 which shows that it is significant at the 5% level.
As for the second question, the difference does appear to vary depending on other factors, though not very much. Even when accounting for the variables of highest degree attained, occupation, and industry, and spouse income, we were only able to reduce the income gap by 646.2, which translates to a 3% reduction from a 26% income gap to a 23% income gap. This still leaves 23% unaccounted for by our current model.
On the one hand, this could provide quantitative evidence for gender discrimination against women. On the other hand, this could simply mean there are other confounding variables we have yet to account for. Some examples of these variables include:
- Work hours: Women may tend to prefer more flexible work hours than males
- Salary negotiation: It is likely that men are more likely to negotiate up their salaries than women
- Maternity leave: Maternity leave is also more widely used than paternity leave and there is no federal law mandating paid maternity leave
As such, we conclude that as there is still 23% of the wage gape that is unexplained by our model, we are not confident that we have accounted for all other possible confounding variables and so we cannot be certain that this remaining gap is caused by discrimination.
Limitations and Confidence
The limitations of our analysis were shown via our diagnostic plots. For our model of only income with gender, while we found that our p-value was statistically and practically significant, the diagnostic plots showed that our model lacked constant variance, thereby reducing our confidence in our p-value. For our model of gender with highest degree, occupation and total children, while this was an improvement on the earlier model in terms of constant variance, there could still be room for improvement as our scale-location plot was not as horizontal as it could have ideally been.
Overall, however, our initial model is believable as our finding of 26% for the overall wage gap is near estimates from other researchers. However, our second model is less believable as other researchers have been able to include other variables and explain a greater part of the wage gap then we have done. As such, we would not feel confident presenting our analysis to policy makers. If we were to present to policymakers now without accounting for confounding variables, our main policy recommendation will likely be quite different compared to if we had accounted for all possible confounding variables.
On a final note, it is clear that more research and more legislation must be done. If we were to estimate the lifetime earnings a woman loses because of the wage gap, we conservatively estimate this to be around $682950. This makes our findings quite practically significant.